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Abstract 

EquiX is a search language for XML that combines the power of querying with the 
simphcity of searching. Requirements for such languages are discussed and it is shown 
that EquiX meets the necessary criteria. Both a graphical abstract syntax and a formal 
concrete syntax are presented for EquiX queries. In addition, the semantics is defined and an 
evaluation algorithm is presented. The evaluation algorithm is polynomial under combined 
complexity. 

EquiX combines pattern matching, quantification and logical expressions to query both 
the data and meta-data of XML documents. The result of a query in EquiX is a set of XML 
documents. A DTD describing the result documents is derived automatically from the query. 

1 Introduction 

The widespread use of the World-Wide Web has given rise to a plethora of simple query proces- 
sors, commonly called search engines. Search engines query a database of semi-structured data, 
namely HTML pages. Currently, search engines cannot be used to query the meta-data content 
in such pages. Only the data can be queried. For example, one can use a search engine to find 
pages containing the word "villain". However, it is difficult to obtain only pages in which villain 
appears in the context of a character in a Wild West movie. More and more XML pages are 
finding their way onto the Web. Thus, it is becoming increasingly important to be able to query 
both the data and the meta-data content of the pages on the Web. We propose a language for 
querying (or searching) the Web that fills this void. 

Search engines can be viewed as simple query processors. The query language of most 
search engines is rather restricted. Both traditional database query languages, such as SQL, 
and newly proposed languages, such as XQL [|RLS9^ , XML-QL pFF+98| and Xmas [|BLP+98| , 



LPVV99| ] , are much richer than the query language of most search engines. However, the limited 



expressiveness of search engines appears to be an advantage in the context of the Web. Many 
Internet users are not familiar with database concepts and find it hard to formulate SQL queries. 
In comparison, when it comes to using search engines, experience has proven that even novice 
Internet users can easily ask queries using a search engine. It is likely that this is true because 
of the inherent simplicity of the search-engine query languages. 

Consequently, an apparent disadvantage of search-engine languages is really an advantage 
when it comes to querying the Web. Thus, it is imperative to first understand the requirements 
of a query language for the Web, before attempting to design such a language. We believe that 
the Web gives rise to a new concept in query languages, namely search languages. We will 
present design criteria for search languages. 



*Institute for Computer Science, The Hebrew University, Jerusalem 91904, Israel. 

^German Research Center for Artificial Intelligence Cm bH, Stuhlsatzenhausweg 3, 66123 Saarbriicken, 
Germany 

^Computer Science Dept., K. U. Leuven, Heverlee, Belgium 



1 



As its name implies, a search language is a language that can be used to search for data. We 
differentiate between the terms search and query. Roughly speaking, a search is an imprecise 
process in which the user guesses the content of the document that she requires. Querying is a 
precise process in which the user specifics exactly the information she is seeking. In this paper 
we define a language that has both searching and querying capabilities. We call a language that 
allows both searching and querying a search language. 

We call a query written in a search language a search query and the query result a search 
result. Similarly, we call a query processor for a search language a search processor. Prom 
analyzing popular search engines, one can define a set of criteria that should guide the designing 
of a search language and processor. Wc present such criteria below. 

1. Format of Results: A search result of a search query should be either a set of documents 
(pages) or sections of documents that satisfy the query. In general, when searching, the 
user is simply interested in finding information. Thus, a search query need not perform 
restructuring of documents to compute results. This simplifies the formulation of a search 
query since the format of the result need not be specified. 

2. Pattern Matching: A search language should allow some level of pattern matching both 
on the data and meta-data. Clearly, pattern matching on the data is a convenient way 
of specifying search requirements. Pattern matching on the meta-data allows a user to 
formulate a search query without knowing the exact structure of the document. In the 
context of searching, it is unlikely that the user will be aware of the exact structure of the 
document that she is seeking. 

3. Quantification: Many search languages currently implemented on the Web allow the 
user to specify quantifications in search queries. For example, the search query "+ Wild 
- West", according to the semantics of many of the search engines found on the Web, 
requests documents in which the word "Wild" appears (i.e., exists) and the word "West" 
does not appear (i.e., not exists). The ability to specify quantifications should be extended 
to allow quantifications in querying the meta-data. 

4. Logical Expressions: Many search engines allow the user to specify logical expressions 
in their search languages, such as conjunctions and disjunctions of conditions. This should 
be extended to enable the user to use logical expressions in querying the meta-data. 

5. Iterative Searching Ability: The result of a search query is generally very large. Many 
times a result may contain hundreds, if not thousands, of documents. Users generally do 
not wish to sift through many documents in order to find the information that they require. 
Thus, it is a useful feature for a search processor to allow requerying of previous results. 
This enables users to search for the desired information iteratively, until such information 
is found. 

6. Polynomial Time: The database over which search queries are computed is large and 
is constantly growing. Hence, it is desirable for a search query to be computable in 
polynomial time under combined complexity (i.e., when both the query and the database 
are part of the input). 

When designing a search language, there is an additional requirement that is more difficult 
to define scientifically. A search language should be easy to use. We present our final criterion. 

7. Simplicity: A search should be simple to use. One should be able to formulate queries 
easily and the queries, once formulated, should be intuitively understandable. 
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The definition of requirements for a search language is interesting in itself. In this paper 
we present a specific language, namely EquiX, that fulfills the requirements || through ^. From 
our experience, we have found EquiX search queries to be intuitively understandable. Thus, we 
believe that EquiX satisfies the additional language requirement of simplicity. EquiX is rather 
unique in that it combines both polynomial query evaluation (under combined complexity) with 
several powerful querying abilities. In EquiX, both quantification and negation can be used. In 
an extension to EquiX we allow aggregation and a limited class of regular expressions. Both 
searching and querying can be performed using the EquiX language. EquiX also simplifies the 
querying process by automatically generating both the format of the result and a corresponding 
DTD. 



This paper extends previous work| CKK+99 |. In Section g we present a data model for 



XML documents. Both the concrete and abstract syntax for EquiX queries are described in 
Section ^. In Section ^ we define the semantics of EquiX, and in Section ^ a polynomial 
algorithm for evaluating EquiX queries is presented. In Section ^ we present some extensions 
to our language and in Section |^ we conclude. A procedure for computing a result DTD is 
presented in Appendix We prove the correctness of our evaluation algorithm in Appendix 

2 Data Model 



We define a data model for querying XML documents |BPSM98]. At first, we assume that each 



XML document has a given DTD. In Section y we will relax this assumption. The term element 
will be used to refer to a particular occurrence of an element in a document. The term element 
name will refer to a name of an element and thus, may appear many times in a document. 
Similarly we use attribute to refer to a particular occurrence of an attribute and attribute name 
to refer to its name. At times, we will blur the distinction between these terms when the meaning 
is clear from the context. 

We introduce some necessary notation. A directed tree over a set of nodes is a pair 
T = (N, E) where E N x N and E defines a tree-structure. We say that the edge (n, n') is 
incident from n and incident to n' . Note that in a tree, there is at most one edge incident to 
any given node. We assume throughout this paper that all trees are finite. A directed tree is 
rooted if there is a designated node r £ N, such that every node in N is reachable from r in T. 
We call r the root of T. We denote a rooted directed tree as a triple T = (N, E, r). 

An XML document contains both data (i.e., atomic values) and meta-data (i.e., elements 
and attributes). The relationships between data and meta-data, (and between meta-data and 
meta-data) are refiected in a document by use of nesting. 

We will represent an XML document by a directed tree with a labeling function. The data 
and meta-data in a document correspond to nodes in the tree with appropriate labels. Nodes 
corresponding to meta-data are complex nodes while nodes corresponding to data are atomic 
nodes. The relationships in a document are represented by edges in the tree. In this fashion, an 
XML document is represented by its parse tree. 

Note that using ID and IDREF attributes one can represent additional relationships between 
values. When considering these relationships, a document may no longer be represented by a 
tree. In the sequel we will utilize ID and IDREF attributes to answer search queries. 

In general, a parsed XML document need not be a rooted tree. An XML document that 
gives rise to a rooted tree is said to be rooted. The element that corresponds to the root of the 
tree is called the root element. Given an XML document that is not rooted, one can create a 
rooted document by adding a new element to the document and placing its opening tag at the 
beginning of the document, and its closing tag at the end of the document. This new element 
will be the root element of the new document. With little effort we can adjust the DTD of 
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the original document to create a new DTD that the new document will conform to. Thus, we 
assume without loss of generality that all XML documents in the database are rooted. 

We now give a formal definition of an XML document. We assume that there is an infinite 
set A of atoms and infinite set C of labels. 

Definition 2.1 (XML Document) An XML document is a pair X = (T,l) such that 

• T = {N, E, r) is a rooted directed tre^; 

• I : N ^ CU A is a labeling function that associates each complex node with a value in C 
and each atomic node with a value in A. 

We assume that each DTD has a designated element name, called the root element name of 
the DTD. Consider a DTD d with a root element name e. We say that a document X = (T, /) 
with root r strictly conforms to d if 

1. the document X conforms to d (in the usual way |BPSM9^ ]) and 

2. the function I assigns the label e to the root r (i.e., l{r) = e). 

The following DTD with root element name movielnf o describes information about movies. 



< 


ELEMENT 


movielnf 


(movie+ , actor+) > 


< 


ELEMENT 


movie 


(descr , title , character+) > 


< 


ELEMENT 


actor 


(name)> 


< 


ATTLIST 


actor 








id 


ID #REQUIRED> 


< 


ELEMENT 


descr 


(#PCDATA)> 


< 


ELEMENT 


title 


(#PCDATA)> 


< 


ELEMENT 


name 


(#PCDATA)> 


< 


ELEMENT 


character 


EMPTY> 


< 


ATTLIST 


character 








role 


CDATA #REQUIRED 






star 


IDREF #REQUIRED> 



In Figure ^ an XML document containing movie information is depicted. This document 
strictly conforms to the DTD resented above. 

Note that the nodes in Figure |l] are numbered. The numbering is for convenient reference 
and is not part of the data model. 

A catalog is a pair C = (d, S) where d is a DTD and S is a set of XML documents, each of 
which strictly conforms to d. A database is a set of catalogs. Note the similarity of this definition 
to the relational model where a database is a set of tuples conforming to given relation schemes. 

This data model is natural and has useful characteristics. Our assumption that each XML 
document conforms to a given DTD implies that the documents are of a partially known struc- 
ture. We can display this knowledge for the benefit of the user. Thus, the task of finding 
information in a database does not require a preliminary step of querying the database to dis- 
cover its structure. 

^Note that an XML document is a sequence of characters. Thus, to properly model the ordering of elements 
in a document, an ordering function on the children of a node should be introduced. For simplicity of exposition 
we chose to omit this in the paper. 
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movielnfo 




descr 

(10) I 

"takes place 
in the 

Wild West... 



(2) mp\^e 
\ 

title character 

(i2y \^ 

The Lone ^^jg 

Cowboy (25)^ (26) j 

cowboy 436 



Figure 1: An XML Document 

3 Search Query Syntax 

In this section we present both a concrete and an abstract syntax for EquiX search queries. A 
search query written in the concrete syntax is a concrete query and a search query written in 
the abstract syntax is an abstract query. 

3.1 Concrete Query Syntax 

The concrete syntax is described informally as part of the graphical user interface currently 
implemented for EquiX. Intuitively, a query is an "example" of the documents that should 
appear in the output. By formulating an EquiX query the user can specify documents that she 
would like to find. She can specify constraints on the data that should appear in the documents. 
We call such constraints content constraints. She can also specify constraints on the meta-data, 
or structure, of the documents. We call such constraints structural constraints. In addition, the 
user can specify quantification constraints which constrain the data and meta-data that should 
appear in the resulting documents by determining how the content and structural constraints 
should be applied to a document. 

The user formulates her query interactively. The user chooses a catalog {d, S) . Only docu- 
ments in S will be searched (queried). At first a minimal query is displayed. In a minimal query, 
only the root element name of d is displayed. A minimal query looks similar to an empty form 
for querying using a search engine (see Figure ^). The user can then add content constraints 
by filling in the form, or add structural constraints by expanding elements that are displayed. 
When an element is expanded, its attributes and subelements, as defined in d, are displayed. 
The user can add content constraints to the elements and attributes. The user can also specify 
the quantification that should be applied to each element and attribute, i.e., quantification con- 
straints. This can be one of exists, not exists, for all, and not for all (written in a user friendly 
fashion). In addition, the user can choose which elements in the query should appear in the 
output. 

In Figure ^ an expanded concrete query is depicted. This query was formulated by exploring 
the DTD presented in Section ^. It retrieves the title and description of Wild West movies in 
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File View Constraints 
M Find eacli XML^ocumentthat 




Search Condition Q 



Enter a search condition: 




Wild West 



^ Output: 
^ Aggregation: 



OK 



Cancel 



Exists 



Figure 2: Minimal query that finds documents containing the phrase "Wild West" 



which Redford does not star as a villain. Intuitively, answering this query is a two part process: 

1. Search for Wild West movies. The phrase "Wild West" may appear anywhere in the 
description of a movie. For example, it may appear in the title or in the movie description. 
Intuitively, this is similar to a search in a search engine. 

2. Query the movies to find those in which Redford does not play as a villain. This condition 
is rather exact. It specifies exactly where the phrases should appear and it contains a 
quantification constraint. Thus, conceptually, this is similar to a traditional database 
query. 



3.2 Abstract Query Syntax 

We will present an abstract syntax for EquiX and show how a concrete query is translated to 

an abstract query. 

A boolean function that associates each sequence of alpha-numeric symbols with a truth 
value among {_L, T} is a string matching function. We assume that there is an infinite set C 
of string matching functions, that C is closed under complement and that the function T is a 
member of C. One such function might be: 



(^ildAwesti^) 



T if s contains the words wild and west 
± otherwise 



We define an abstract query below. 

Definition 3.1 (Abstract Query) An abstract query is a rooted directed tree T augmented by 
four constraining functions and an output set, denoted Q = {T,l,c,o,q,0) where 

• I : N ^ C is a labeling function that associates each node with a label; 

• c : N ^ C is a content function that associates each node with a string matching function; 

• o:7V— >{A,V} is an operator function that associates each node with a logical operator; 



□IE] 

Help 

^ Find each XML document that: — 
<^ C3 Has a movielnfo that 

<f C3 Has a movie that matches "Wiid West" that 
^ Has a descr and that 
^ Has a titie and that 
[3 Does not have a character that 

Q Has a roie that matches "viiiian" and that 
D Has a star that matches "Redford" and that 
O [3 [Double-Clicli:to Add Conditions] 
®- r3 [Double-Clicl< to Add Conditions] — 
*>C3 Has an actor and that ~ 

status: 



^ Aggregation: 
Quantification: 



Figure 3: Query that finds titles and descriptions of movies in which Redford isn't a villain 

• q : E ^ {3, V} is a quantification function that associates each edge with a quantifier; 

• O Q N is the set o/ projected nodes, i.e., nodes that should appear in the result. 

Consider a node n. If o(n) = A, we will say that n is an and-node. Otherwise we will say that n 
is an or-node. Similarly, consider an edge e. If ^(e) = 3, we will say that e is an existential-edge. 
Otherwise, e is a universal-edge. 

We give an intuitive explanation of the meaning of an abstract query. The formal semantics is 
presented in Section ^. When evaluating a query, we will attempt to match nodes in a document 
to nodes in the query. In order for a document node nx to match a query node nq, the function 
c{nQ) should hold on the data below nx- In addition, if uq is an and-node (or-node), we require 
that each (at least one) child of hq be matched to a child of nx- If nx is matched to uq 
then a child of nx can be matched to a child n'g of nq, only if the edge {nq^n'q) can be 
satisfied w.r.t. nx- Roughly speaking, in order for an universal-edge (existential-edge) to be 
satisfied w.r.t. nx, all children (at least one child) of nx that have the same label as n'q must 
be matched to n'q. 

Note that in a concrete query the user can use the quantifiers {3, V, -i3, -iV} and all nodes 
are implicitly and-nodes. In an abstract query only the quantifiers {3,V} may be used and the 
nodes may be either and-nodes or or-nodes. When creating a user interface for our language we 
found that the concrete query language was generally more intuitive for the user. We present 
the abstract query language to simplify the discussion of the semantics and query evaluation. 
Note that the two languages are equivalent in their expressive power. 

We address the problem of translating a concrete query to an abstract query. Most of 
this process is straightforward. The tree structure of the abstract query is determined by the 
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structure of the concrete query. The labehng function I is determined by the labels (i.e., element 
and attribute names) appearing in the concrete query. The set O is determined by the nodes 
marked for output by the user. 

Translating the quantification constraints is slightly more complicated. As a first step we 
augment each edge in the query with the appropriate quantifier as determined by the user. We 
associate each node with the A operator and with the content constraint specified by the user. 
Note that an empty content constraint in a concrete query corresponds to the boolean function 
T. Next, we propagate the negation in the query. When negation is propagated through an 
and- node (or- node), the node becomes an or-node (and-node), and the string matching function 
associated with the node is replaced by its complement. Similarly, when negation is propagated 
through an existential-edge (universal-edge), the edge becomes a universal-edge (existential- 
edge). In this fashion, we derive a tree in which each edge is associated with 3 or V and each 
node is associated with A or V. The functions o, q, and c are determined by the process described 
above. 

The concrete query in Figure |^ is represented by the abstract query in Figure The string 
matching functions are specified in italics next to the corresponding nodes. Black nodes are 
output nodes. In the sequel, unless otherwise specified, the term query will refer to an abstract 
query. 




Figure 4: Abstract query for the concrete query in Figure |3[ Output nodes are colored black. 

Recall the search language requirements we presented in Section ||. We postulated that 
in a search language, it should not be necessary for the user to specify format of the result 
(Criterion |^). In EquiX, by defining the set O, the user only specifies what information should 
she wants the result to include, and does not explicitly detail the format in which it should 
appear. We suggested that it is important for there to be pattern matching, quantification, 
and logical expressions for constraining data and meta-data (Criterion ^, |3|, and|^. For data, 
these can all be specified using the content function c. For meta-data, the pattern to which 
the structure should be matched is specified by T and /, the quantification is specified by q, 
and logical operators can be specified using o. The result of an EquiX query is a set of XML 
documents. In Section ^ we show how a DTD for the result documents can be computed. Thus, 
requerying of results is possible in EquiX (Criterion ^). In Section ^ we show that EquiX queries 
can be evaluated in polynomial time, and thus, EquiX meets Criterion 



8 



4 Search Query Semantics 



When describing the semantics of a query in a relational database language, such as SQL or 
Datalog, the term matching can be used. The result of evaluating a query are all the tuples 
that match the schemas mentioned in the query and satisfy the constraints. We describe the 
semantics of an EquiX query in a similar fashion. 

We first define when a node in a document matches a node in a query. Consider a document 
X, and a query Q. Suppose that the labeling function of X is Ix and the labeling function of 
Q is Iq. We say that a node nx in X matches a node uq in Q if lx{nx) = Iging). We denote 
the parent of a node n by p{n). We now define a matching of a document to a query. 

Definition 4.1 (Matciiing) Let X = {Tx,lx) be an XML document, with nodes Nx and root 
rx, Let Q = (Tq, Iq,c, o, q, O) he a query tree with nodes Nq and root rq. A matching of X to 
Q is a function ji : Nq —>■ 2^^ , such that the following hold 

1. Roots Match: ^{rq) = {rx}; 

2. Node Matching: if nx € ^{nq), nx matches nq; 

3. Connectivity: if ux € fJ-in-g) and nx is not the root of X, then p[nx) € fx{p{nq)). 

Note that Condition || requires that the root of the document is matched to the root of the 
query, Condition § insures that matching nodes have the same label, and Condition ^ requires 
matchings to have a tree-like structure. 

We define when a matching of a document to a query is satisfying. We first present some 
auxiliary definitions. Consider an XML document X = (Tx,lx), where Tx = {NxiEx^rx)- 
Consider a node nx in Tx- We differentiate between the textual content (i.e., data) contained 
below the node nx, and the structural content (i.e., meta-data). When defining the textual 
content of a node, we take ID and IDREF values into consideration. We say that is a child 
of nx if {nx,n'x) S Ex- We say that n'-^ is an indirect child of nx if nx is an attribute of type 
IDREF with the same value as an attribute of type ID of n'^- We denote the textual content of 
a node nx as t(nx), defined 

• If nx is an atomic node, then t{nx) = Ixinx)', 

• Otherwise, t{nx) is a concatenation]^ of the content of its children and indirect children. 

We demonstrate the textual content of a node with an example. Recall the XML document 
depicted in Figure ||. The textual content of Node 9, is "villain 436 Jack Robinson". Note that 
the i(24) includes the value "Jack Robinson" since Node 5 is an indirect child of Node 24. 

We discuss when a quantification constraint is satisfied. Consider a document X, a query Q 
and a matching of X to Q. Let nx be a node in X and let e = (ng, rig) be an edge in Q. We 
say nx satisfies e with respect to ^ if the following holds 

• If e is an existential-edge then there is a child of nx such that n'-^ matches n'q and 

• If e is a universal-edge then for all children of nx-, if n'-^ matches n'q, then n'^ € ^{ng). 

^Note that an XML document may be cyclic as a result of ID and IDREF attributes. We take a finite 
concatenation by taking each child into account only once. In addition, the order in which the concatenation is 
taken and the ability to differentiate between data that originated in different nodes may affect the satisfiability 
of a string matching function. This is technical problem that is taken into consideration in the implementation. 
We will not elaborate on this point any further. 
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We define a satisfying matching of a document to a query. 



Definition 4.2 (Satisfying Matching) Let X = {Tx,lx) be an XML document, and let Q = 
(Tq, Iq,c, o, q, O) he a query tree. Let ^ he a matching of X to Q. We say that ^ is a satisfying 
mapping of X to Q if for all nodes uq in Q and for all nodes nx S l^inq) the following conditions 
hold 

1. if Uq is a leaf then c{nQ){t{nx)) = T, i.e., nx satisfies the string matching condition of 
nq; 

2. otherwise (uq is not a leaf): 

(a) if Uq is an or-node then nx satisfies either c{nQ) or at least one edge incident from 
Uq with respect to fi; 

(h) if Uq is an and-node then n'^ satisfies hoth c{nQ) and all edges that are incident from 
Uq with respect to fi. 



Condition |l] implies tiiat the leaves satisfy the content constraints in Q. Conditions 2a and 2b 
imply that X satisfies the quantification constraints in Q. The structural constraints are satisfied 
by the existence of a matching. 

Example 4.3 Recall the query in Figure ^ and the document in Figure |^. Two satisfying 
matchings of the document to the query are specified in the following table. There are additional 
matchings not shown here. 



Query Node 


Ml 


Ai2 


movielnfo 


{0} 


{0} 


movie 


{2} 


{3} 


descr 


{10} 


{14} 


title 


{11} 


{15} 


character 


{12,13} 


{16} 


role 


{25,27} 


{29} 


star 


{26,28} 


{30} 


actor 


{4} 


{5} 



Note that there is no satisfying matching that matches Node 1 to the movie node in the query 
because the universal quantification on the edge connecting movie and character cannot be 
satisfied. 



We presented several matchings of a document to a query. Let ^ and fi' be matchings of a 
document X to a query Q. We define the union of /i and /i' in the obvious way. Formally, given 
a query node uq, 

(/i U fi'){nQ) := ninq) U fi'{nQ) 

There may be an exponential number of matchings of a given document to a given query. 
Note, however, that the following proposition holds. 

Proposition 4.4 (Union of Matchings) Let X he an XML document and let Q he a query. 
Let Ai he the set of all matchings of X to Q. Then the union of all the matchings in A4 is a 
matching. Formally, 
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We say that a document X satisfies a query Q if there exists a satisfying matching of X to 
Q. We now specify the output of evaluating a query on a single XML document. The result of a 
query is the set of documents derived by evaluating the query on each document in the queried 
catalog (i.e., each document that matched the DTD from which the query was derived). 

Intuitively, the result of evaluating a query on a document is a subtree of the document 
(as required in Criterion |I|). The subtree contains nodes of three types. Document nodes 
corresponding to output query nodes appear in the resulting subtree. In addition, we include 
ancestors and descendents of these nodes. The ancestors insure that the result has a tree-like 
structure and that it is a projection of the original document. Recall that the textual content 
of the document is contained in the atomic nodes of the document tree. Hence, the result must 
include the descendents to insure that the the textual content is returned. 

For a given document, query processing can be viewed as the process of singling out the 
nodes of the document tree that will be part of the output. Consider a document X = {Tx,lx) 
with Tx = {Nx, Ex,rx) and a query Q with projected nodes O. Let A4 be the set of satisfying 
matchings of X to Q. The output of evaluating the query Q on the document X is the the 
document defined by projecting Nx on the set Nr := Xout ^anc U X^gg^ defined as 

• -^out ^ I (^^-o £ 0)(3/x G M.) nx S /^(no)}, i.e., nodes in X corresponding 
to projected nodes in Q; 

• Xanc '■= {nx G Xx \ (3n^ G -^out) ™^ is an ancestor of n'x}, i.e., ancestors of nodes in 

^out; 

• X(^ggg := {nx G Xx I i^n'x G -^out) descendent of n^}, i.e., descendents of 
nodes in Xq^^^. 

We call Xji the output set of X with respect to Q. 

The result of applying the query in Figure § to the document in Figure |l| is depicted in 
Figure |5|. Note that the values of "descr" and "title" are grouped by "movie". This follows 
naturally from the structure of the original document. 



movielnfo 



descr 



title 



descr title 



"takes The Lone 

place in the Cowboy 
Wild West..." 



"This 
movie..." 



Secrets of the 
Wild West 



Figure 5: Result Document 



5 Query Evaluation 

Recall that a query is defined by choosing a catalog and exploring its DTD. Consider a query 
Q generated from a DTD d in the catalog {d, S). The result of evaluating Q on the database is 
the set of documents generated by evaluating Q on each document in S. 

We present an algorithm for evaluating a query on a document. There may be an exponential 
number of matchings of a query to a document. Concrete queries contain both quantification 
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and negation. This would appear to be another source of complexity. Thus, it would seem that 
computing the output of a query on a document should be computationally expensive. Roughly 
speaking, however, query evaluation in this case is analogous to evaluating a first-order query 
that can be written using only two variables. Therefore, using dynamic programming |1CLR9C ] 
we can in fact derive an algorithm that runs in polynomial time, even when the query is con- 
sidered part of the input (i.e., combined complexity). Thus, EquiX meets the search language 
requirement of having polynomial evaluation time (Criterion P). 

In Figure ^ we present a polynomial procedure that computes the output of a document, 
given a query. Given a document X and query Q, the procedure Query_Evaluate computes the 
output set Nji of X w.r.t. Q. Note that the value of t{nx) for each document node nx can be 
computed in a preprocessing step polynomial time. Query_Evaluate uses the procedure Matches 
shown in Figure 0. Given a query node nq and a document node nx, the procedure Matches 
checks if it is possible that nx G /^(^q) for some matching ^, based on the subtrees of nq and 
nx- 



Algorithm Query_Evaluate 

Input A document X = {Tx,lx) s.t. Tx = {Nx,Ex,rx), 

A query tree Q = {Tq,Iq, c, o, q, O) s.t. Tq = {Nq, Eq, rq). 
Output Nn C Nxi i.e., the outputed document nodes 

Initialize match_array[][] to false 
Queuei := Nq, ordered by descending depth 
While (not isEmpty (Queuei)) do 
nq := Dequeue{Queuei) 

For all nx G Nx such that path{nx) = path{nq) do 

match_array[nq,nx]-= Matches(nQ, nx , match-array) 

Nr :=0 

Queue2 := Nq, ordered by ascending depth 
While (not isEmpty(Queue2)) do 
nq := Dequeue{Queue2) 
For all nx G Nx do 

If {nq 7^ rq and not match_array[p{nq) ,p{nx)]) then 

match_array[nq ,nx] ■= false 
Else If {match_array[nq ,nx] and nq € O) then 
Nr := Nr U {nx} U anc{nx) U desc{nx) 

Return Nr 



Figure 6: Evaluation of an EquiX Query 

Note that path{n) is the sequence of element names on the path from the root of the query 
to n, and anc(n) {desc(n)) is the set of ancestors (descendents) of n. Note also that match_array 
is an array of size Nq x Nx of boolean values, where Nq are the nodes in the query and Nx are 
the nodes in the document. Observe that in Figure ^ we order the nodes by descending depth. 
This insures that when Matches{nq,nx,match_array) is called, the array match_array is already 
updated for all the children of nq and nx- The procedure Query_Evaluate does not explicitly 
create any matchings. However, the following theorem holds. 
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Procedure M atches(nQ ,nx , match_array) 

Input A query node uq 

A document node nx 

An array match-array 
Output true if nx may be in ^{nq) for a matching fi, 

based on the subtrees of nx and ng, and false otherwise 

tc := c{nQ){t{nx)) 

If nq is a terminal node return tc 

Let Mq be the set of children of nq in Q 

For each mq E Mq do: 

Let Mx be the set of children, mx, of nx in X such that Ixi^^x) = ^qI^^q) 

If {nq,mq) is an existential-edge then 

status{mq) := \lmx&Mx 'match_array[mq,mx\ 

Else status{mq) := Amx&Mx ''T^o.tch_array[mq,mx] 
If ng is an or-node then 

return tcV{\/^^^MQ status{mq)) 
Else return teA(Amp status{mq)) 



Figure 7: Satisfaction of a Node Procedure 

Theorem 5.1 (Correctness of Query_Evaluate) Given document X and a query Q, the algo- 
rithm Query_Evaluate retrieves the output set of X w.r.t. Q. 

In Appendix ^ we prove this theorem. The procedure Query_Evaluate runs in polynomial 
time in combined complexity. Let \D\ be the size of the data in document X, i.e., the size of 
X when ignoring X's meta-data. Let C{m) be an upper-bound on the runtime of computing a 
string-matching constraint on a string of size m. 

Theorem 5.2 (Polynomial Complexity) Given document X and a query Q, the algorithm 
Query.Evaluate runs in time 0(|iVxI • I^VqI • (I^VqI • \Nx\ + C{\D\))). 

Note that query evaluation generates a set of documents. Recall that a query is formulated 
by exploring a DTD and only documents in the catalog of the DTD chosen will be queried. Thus, 
in order to allow iterative querying or requerying of results, a DTD for the resulting documents 
must be defined. A result DTD is a DTD to which the resulting documents strictly conform. In 
Appendix ^ we present a polynomial procedure that computes a result DTD for a given query. 
The result DTD is linear in the size of the DTD from which the query was originated. The 
compactness of the result DTD makes the requerying process simpler, since requerying entails 
exploring the result DTD. Thus, EquiX fulfills the search language requirement of ability to 
perform requerying (Criterion [s]). 

6 Extending EquiX Queries 

EquiX can be extended in many ways to yield a more powerful language. In this section we 
present two extensions to the EquiX language. These extensions add additional querying ability 
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to EquiX. After extending EquiX, the search language requirements |l| through ^ are stih met. 
However, it is a matter of opinion if EquiX stih fulfills Criterion ^ requiring simplicity of use. 
Thus, these extensions are perhaps more suitable for expert users. 



6.1 Adding Aggregation Functions and Constraints 

We extend the EquiX language to allow computing of aggregation functions and verification of 
aggregation constraints. We call the new language EquiX^ss. 

We extend the abstract query formalism to allow aggregation. An aggregation function is 
one of {count, min, max, sum, avg}. An atomic aggregation constraint has the form fOv where 
/ is an aggregation function, € {<,<,=, 7^, >,>}, and v is a constant value. An aggregation 
constraint is a (possibly empty) conjunction of atomic aggregation constraints. In EquiX'^sg ^ 
query is a tuple (T, /, c, 0, q, a, ac, O) as in EquiX, augmented with the following functions: 



a is an aggregation specifying function that associates each node with a (possibly empty) 
set of aggregation functions; 

• Oc is an aggregation constraint function that associates each node with an aggregation 
constraint. 

Given a node nq, the aggregation specifying functions a{nQ) (and similarly the aggregation 
constraint functions) are applied to t^ng). Note that the min and max functions can only be 
applied to an argument whose domain is ordered. Similarly, the aggregation functions sum and 
avg can only be applied to sets of numbers. There is no way to enforce such typing using a 
DTD (although it is possible using an XML Schema [ Con| ]). When a function is applied to an 



argument that does not meet its requirement, its result is undefined. 

The function a adds the computed aggregation values to the output. This is similarly to 
placing an aggregation function in the SELECT clause of an SQL query. The function Oc is used 
to further constrain a query. This is similar to the HAVING clause of an SQL query. 

In order to use an aggregation function in an SQL query, one must include a GROUP- 
BY clause. This clause specifies on which variables the grouping should be performed, when 
computing the result. To simplify the querying process, we do not require the user to specify 
the GROUP-BY variables. EquiX'^sg ^ggg ^ simple heuristic rule to determine the grouping 
variables. Suppose that ainq) 7^ for some node nq. Let no be the lowest node above nq in 
Q for which one of the following conditions hold 

• no G O or 

• no is an ancestor of a node in O. 

Note that no is the lowest node above nq where both textual content and aggregate values may 
be combined. EquiX^^s groups by nq when computing the aggregation functions on nq. In a 
similar fashion, EquiX^^^g performs grouping in order to compute aggregation constraints. 

Our choice for grouping variables is natural since it takes advantage of the tree structure 
of the query, and thus, suggests a polynomial evaluation algorithm. It is easy to see that 
adding aggregation functions and constraints does not affect the polynomiality of the evaluation 
algorithm. The algorithm for creating a result DTD must also be slightly adapted in order to 
take into consideration the aggregation values that are retrieved. Thus, EquiX^^^ meets the 
search language requirements || through ^. 
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6.2 Querying Ontologies using Regular Expressions 

In EquiX, the user chooses a catalog and queries only documents in the chosen catalog. It 
is possible that the user would like to query documents conforming to different DTDs, but 
containing information about the same subject. In EquiX^''^^ this ability is given to the user. 
Thus, EquiX'^'^^ is useful for information integration. 

An ontology, denoted O, is a set of terms whose meanings are well known. Note that an 



ontology can be implemented using XML Namespaces [BHL99|. We say that a document X 
can he described by O if some of the element and attribute names in X appear in O. When 
formulating a query, the user chooses an ontology of terms that describes the subject matter 
that she is interested in querying. Documents that can be described by the chosen ontology 
will be queried. A query tree in EquiX'''^s is a tuple {T,l,c,o,q,0) as in EquiX. However, in 
EquiX'''^^, / is a function from the set of nodes to O. 

Semanticly, an EquiX'^'^^ query is interpreted in a different fashion from an EquiX query. 
Each edge is implicitly labeled with the "+" symbol. Intuitively, an edge in a query corresponds 
to a sequence of one or more edges in a document. We adapt the definition of satisfaction of an 
edge (presented in Section ^) to reflect this change. 

Consider a document X, a query Q and a matching /i of X to Q. Let nx be a node in X 
and let e = (nQ^ng) be an edge in Q. We say nx satisfies e with respect to ^ if the following 
holds 

• If e is an existential-edge then there is a descendent of nx such that matches h'q 
and G //(rig). 

• If e is a universal-edge then for all descendents n'^ of nx, if n'^ matches ng, then n'^ € 

Note that the only change in the definition was to replace the words child and children with 
descendent and descendents. 

In a straightforward fashion, we can modify the query evaluation algorithm to reflect the 
new semantics presented. The new algorithm remains polynomial under combined complexity. 
Note that it is no longer possible to create a result DTD if we do not permit a result DTD to 
contain content definitions of type ANY. This results from the possible diversity of the documents 
being queried. However, EquiX'^^s still meets Criterion |5| (i.e., ability to requery results), since 
the resulting documents can be described by the chosen ontology. Thus, EquiX'^'^^ meets the 
search language requirements ^ through ^. 

7 Conclusion 

In this paper we presented design criteria for a search language. We defined a specific search 
language for XML, namely EquiX, that fulfills these requirements. Both a user-friendly con- 
crete syntax and a formal abstract syntax were presented. We defined an evaluation algorithm 
for EquiX queries that is polynomial even under combined complexity. We also presented a 
polynomial algorithm that generates a DTD for the result documents of an EquiX query. 

We believe that EquiX enables the user to search for information in a repository of XML 
documents in a simple and intuitive fashion. Thus, our language is especially suitable for use 
in the context of the Internet. EquiX has the ability to express complex queries with negation, 
quantification and logical expressions. We have also extended EquiX to allow aggregation and 
limited regular expressions. 

Several XML query languages have been proposed recently, such as XML-QL [ DFF"^9^ , 



XQL iRLS98^ and Lorel |GMW99|1 . These Ian guages are powerful in their querying ability. 



15 



However, they do not fulfill some of our search language requirements. In these languages, the 
user can perform restructuring of the result. Thus, the format of the result must be specified, 
in contradiction to Criterion |^. Furthermore, XML-QL and XQL are limited in their ability 
to express quantification constraints (Criterion |^. Most importantly, none of these languages 
guarantee polynomial evaluation under combined complexity (Criterion^). 

As future work, we plan to extend the ability of querying ontologies and to allow more 
complex regular expressions in EquiX. XML documents represent data that may not have a strict 
schema. In addition, search queries constitute a guess of the content of the desired documents. 
Thus, we plan to refine EquiX with the ability to deal with incomplete information | KNS9£ ] 
and with documents that "approximately satisfy" a query. Search engines perform an important 
service for the user by sorting the results according to their quality. We plan on experimenting to 
find a metric for ordering results that takes both the data and the meta-data into consideration. 

As the World-Wide Web grows, it is becoming increasingly difficult for users to find desired 
information. The addition of meta-data to the Web provides the ability to both search and query 
the Web. Enabling users to formulate powerful queries in a simple fashion is an interesting and 
challenging problem. 



A Creating a Result DTD 

In Section |5| we described the process of evaluating a query on a database. Query evaluation 
generates a set of documents. A query is formed using a chosen DTD, called the originating 
DTD, and only documents strictly conforming to the originating DTD will be queried. Thus, 
in order to allow iterative querying or requerying of results, a DTD for the resulting documents 
must be defined. Given a query Q, if any possible result document must conform to the DTD 
dfi, we say that cIr is a result DTD for Q. In this section we present a procedure that given a 
query Q, computes in polynomial time a result DTD for Q. Thus, we show that EquiX fulfills 
the search language requirement of ability to perform requerying (Criterion ^). 

A DTD is a set of element definitions, and attribute list definitions. An element definition 
has the form <!ELEMENT e (p >, where e is the element name being defined and ip is its content 
definition. An attribute list definition has the form <!ATTLIST e • • • V'n >i where e is an 
element name and "01 • • • V'n Eire definitions of attributes for e. The set of element names defined 
in a DTD d is its element name set, denoted £d- 

Consider a query Q = (T, l,c,o, q, O) formulated from a DTD d. We say that element name 
e' is a descendent of element name e in d if e' may be nested within an element e in a document 
conforming to d. Formally, e' is a descendent of e if 

• e' appears in the element definition of e or 

• e' is a descendent of an element name e" which appears in the content definition of e. 

We say that e is an ancestor of e' in d if e' is a descendent of e in d. Note that element name 
e may appear in a resulting document of evaluating Q if there is a node no G O such that 
l{no) = e. Additionally, e may appear in the output if e is an ancestor or descendent in d of an 
element e' that meets the condition presented in the previous sentence. Thus, given a query, we 
can compute in linear time the element name set of the result DTD d^. 

In order to compute the result DTD of a query Q, we must compute the content definitions 
and attribute list definitions for the elements in S^^. In the result DTD, we take the attribute 
list definitions for the elements in as defined in the originating DTD but change all attributes 
to be of type #IMPLIED. Note that the root element name r of the originating DTD will always 
be in Edu- It is easy to see that r is also the designated root element name dn. 
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In Figure ^ we present an algorithm that computes the content definition for an element in 
£dji- Intuitively, any elements that will not appear in the result of a query must be removed 
from the original DTD in order to form the result DTD. In addition, elements will only appear 
in result documents if query constraints are satisfied. Thus, this possible appearance of elements 
may be taken into account when formulating d/j. The algorithm Create_Content_Definition uses 
the procedure presented in Figure |9| in order to simplify the content definition it creates. The 
result DTD is created by computing the content definitions for all e € f^^ and adding the 
attribute list definitions. Note that in the algorithm, dtd-desc{e' , e) is true if e is a descendent 
of e' in the DTD D. In addition, anc{nx, O) is true if nx is an ancestor of some node in O. 



Algorithm Create_Content_Definition 

Input An element e G S^r 

A query Q with nodes Nq and projected nodes O 
The originating DTD d with content definition for e 

Output The content definition of e in the result DTD 

If (3no G O) s.t. ((/(no) = e) or 

{l{no) = e' and dtd_desc{e' , e))) 

then if := ife 
Else 99 := 

For all nodes uq € Nq do 

If {{l{nQ) = e) and {anc{nQ,0))) then 

For all elements e' in do 

If there is a child of ng, namely h'q s.t. 
((/(tiq) = e') and {anc{nQ,0))) then 

Replace all occurrences of e' in ipng with (e'?) 
Else Replace all occurrences of e' in (png with 

99 := (/? I ipng 

Return Simplify((^). 



Figure 8: Content Definition Generation Algorithm 



Theorem A.l (Correctness of DTD Creation) Let Q be a query with an originating DTD 
d and let X be a document. Suppose that the result of evaluating Q on X is the XML document 
R. Then R strictly conforms to the result DTD as formed by the process described above. In 
addition, the computation of the result DTD can be performed in time 0{\d\ + \Q\). 



Note that it follows from Theorem Al that the result DTD is linear in the size of the 
original DTD. The compactness of the result DTD makes the requerying process simpler, since 
requerying entails exploring the result DTD. 

According to Theorem |A.lj , the resulting documents conform to the result DTD. The question 
arises as to how precisely the result DTD describes the resulting documents. In order to answer 
this question we define a partial order on DTDs ( [|PV99|] ) . Given a DTD d we denote the set of 
XML documents that strictly conform to d as conf{d). Given DTDs d and d' we say that d is 
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Procedure Simplify(99) 

Input Content definition (p 

Output Simplified content definition of 

While there is a change in 

Apply the following rules to (p or its subexpressions 

1. (t,0) t 5. (0)? ^ 

2. (0,t) ^ t 6. (0)* ^ 

3. (t I 0) ^ t? 7. (0)+ 

4. (0 I t) tl 
If = then Return EMPTY 
Else Return Lp 



Figure 9: Definition Simplifying Algorithm 



tighter than d\ denoted d ^ d', if conf{d) Q conf{d'). We say that d is strictly tighter that d', 
denoted d -< d', if conf{d) C conf{d'). 

Intuitively, it would be desirable to find a result DTD d^ that is as tight as possible, un- 
der the restriction that all possible result documents must strictly conform to d^. However, 
our algorithm does not necessarily find the tightest possible result DTD. In other words, our 
algorithm may create a result DTD dR although there exists a DTD d'^ to which all resulting 
documents must strictly conform and -< d^. If d/j is the tightest possible result DTD, we call 
dn a minimal result DTD. According to the following Proposition, there is a query and DTD 
for which a minimal result DTD must be exponential in the size of the original DTD. 

Proposition A. 2 (Exponential Result DTD) There is a query Q created from an originat- 
ing DTD d, such that if d' is a minimal result DTD of Q, then d' is of size 0{ \d\\ ). 

Observe that an exponential blowup of the original DTD is undesirable for two reasons. 
First, creating such a DTD is intractable. Second, if the result DTD is of exponential size, then 
it is difficult for a user to requery previous results. Thus, our algorithm for creating result DTDs 
actually returns a convenient DTD, although it is not always minimal. 



B Correctness of Query _Evaluate 

In this Section we prove the correctness of the algorithm Query_Evaluate presented in Figure ^. 
We first prove some necessary lemmas. 

Lemma B.l Let ux he a node in a document X and let nq he a node in a query Q. If there 
exists a matching /i of X to Q such that nx € fJ-inq) then path(nx) = path{nq). 

Proof. Suppose that nx G ^{nq). We show by induction on the depth of nq that path{nx) = 
path{nq). Suppose that the labeling function of Q is Iq and that the labeling function of X is 
Ix. 

Case 1: Suppose that nq is the root of Q. According to Condition |2| in Definition 4.1, it holds 
that lq{nq) = lx{nx)- According to Condition [l| in Definition |4.l| , nx is the root of X. Thus, 
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clearly the claim holds. 

Case 2: Suppose that ng is of depth m. Once again, according to Condition |2| in Definition 4.1 



it holds that iQijiQ) = Zx(^x)- In addition, p{nx) € ix{p{nQ)) (see Condition ^ in Defini- 
tion 0). Note that pinq) is of depth m — 1. Thus, by the induction hypothesis, path{p{nx)) 
= path{p{nQ)). Our claim follows. □ 

We present an auxiliary definition. We define the height of a query node riQ, denoted h{nQ), 

as 

if HQ is a leaf node 

inax{h{n'Q)\n'Q is a child of nq} + 1 otherwise 

We show that the algorithm implicitly defines a satisfying matching pji. The nodes that are 
returned are those corresponding to output nodes in ///j, and their ancestors and descendents. 
We define the function pji : Nq — > 2^^ in the following way: 

nx £ pRinq) <^=^ match_array[nq,nx]= "true" 

Note that we consider the values of match_array at the end of the evaluation of Query_Evaluate. 
We call ^Fi the retrieval function of Query_Evaluate w.r.t. X and Q. 

Lemma B.2 (Retrieval Function is a Satisfying Matching) Let X be a document, let Q 
be a query and let /x/j be the retrieval function o/ Query_Evaluate w.r.t. X and Q. Then is a 
satisfying matching of X to Q. 



Proof. We show that /i/j is a matching, i.e., that /i/j meets the conditions in Definition 4.1 



Roots Match: The only node that has the same path as rq is rx. Thus, the only 
time that the function Matches is called for the root of the query is with the root of the 
document. Thus, the value of iJ-R{rq) must either be either {rx} or 0. However, it is 
easy to see that if fJ-nirq) = then Query_Evaluate returns 0. Thus, it must hold that 
liR{rq) = {rx}. 

Node Matching: If nx £ ^ii{nq) then Matches was called with nq and nx- Thus nx 
and nq have the same path, and hence, nx matches nq. 

Connectivity: If match_aiiay\j){nq) ,p{nx)] does not hold, then match_array[nq ,nx] is 
assigned the value "false" . Therefore, clearly the connectivity requirement of a matching 
holds. 



We now show that hr is a satisfying matching (see Definition 4.2). Suppose that nx G 
/^ij('T-Q) for a document node nx and a query node nq. Note that this implies that Matches 
returned the value "true" when applied to nq and nx- We prove by induction on the height of 
nq that the appropriate condition holds. We consider three cases. 

• Suppose that h{nq) = 0. Then nq is a leaf and c{nq){t{nx)) must hold as required. 

• Suppose that h{nq) > and that nq is an or-node. The procedure Matches returned the 
value "true" when applied to nq and nx. Therefore, one of the following must hold: 

1. The condition c{nq){t{nx)) = T holds. 

2. At the time of application of Matches to nq and nx, there was a child mq of nq 
and a child mx of nx such that match_array[mq,mx] had the value "true". Note 
that it follows that mx matches mq. The value of match_array[mq ,mx] will not be 
changed to "false" during the evaluation of Query_Evaluate since match_array[nq,nx] 
is "true". Thus, mx G fiR{mq). 
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In either case Condition 2a from Definition 4.2 holds as required 



• Suppose that h{nQ) > and that ng is an and-node. We omit the proof as it is similar to 
the previous case. 

Thus, fiR is a satisfying matching as required. □ 

We say that a matching fi contains a matching m' if there exists a matching fi" such that 
fj, = fi' [J . We will show that the retrieval function contains all other satisfying matchings. 

Lemma B.3 (Retrieval Function Containment) Let X be a document, let Q be a query 
and let /i/j be the retrieval function of Query_Evaluate w.r.t. X and Q. Suppose that fi is a 
satisfying matching of X to Q. Then fin contains fi. 

Proof. Suppose that /i is a satisfying matching. We show by induction on the height of ng that 
IJ'{iT'q) ^ fiji(nQ). We first consider the values assigned to match_array during the first pass (the 
bottom-up pass) of the algorithm. 



• Suppose that h{nQ) = 0. Suppose that nx & fj,R{nQ). Then, according to Lemma BT 
path{nx) = path{nQ). Thus, Matches will be called on nx and ng. The condition 
c{nQ){t{nx)) holds. Thus, match_array[nQ ,nx] will be assigned the value "true". 

• Suppose that h^nq) > and uq is an or-node. Suppose that nx G fiR^nq). Once again, 
according to Lemma |B.1| path{nx) = path{nQ). Thus, Matches will be called on nx and 
Uq. It also follows that one of the following must hold: 

1. The value of c{nQ)(t(nx)) is "true". Thus, Matches returns true. 

2. There is a child mg of nq and a child mx of nx such that mx matches tuq and 
mx € fi{mQ). Note that h{mQ) < h{nQ). Thus, by the induction hypothesis, the 
value of match_array[mQ,mx] after the first pass of the algorithm is "true". Thus 
Matches returns true when called on ng and nx- 

• Suppose that h{nQ) > and ng is an and-node. We omit the proof as it is similar to the 
previous case. 

It is easy to see that it follows from the connectivity of fi that if nx G ^(ng) then the value of 
match_array[nQ ,nx] will not be changed to "false". Thus, fi is contained in fiji as required. □ 

We can now prove the theorem required. 

Theorem B.4 (Correctness of Query_Evaluate) Given document X and a query Q, the algo- 
rithm Query_Evaluate retrieves the output set of X w.r.t. Q. 

Proof. Let X be a document and Q be a query. We show that a document node nx is returned 
by Query_Evaluate if and only if nx is in the output set of X w.r.t. Q. 



"<^=" Suppose that nx is in the output set of X w.r.t. Q. Then there is a satisfying matching 
^ of X to Q such that and an output node ng in Q such that either ux G l^inq) or nx is an 
ancestor or descendent of a node in fi{nQ). Let fiR be the retrieval function of Query_Evaluate 
w.r.t. X and Q. According to Lemma |B.3| fi is contained in fiR. Thus, clearly nx is returned 
by Query_Evaluate. 

"^" Suppose that nx is returned by Query_Evaluate. Let fiR be the retrieval function 
of Query_Evaluate w.r.t. X and Q. Then there is an output node ng in Q such that either 
f^x £ l^Rinq) or nx is an ancestor or descendent of a node in fiR{nQ). According to Lemma 
fiR is a satisfying matching of X to Q. Thus, nx is in the output set of X w.r.t. Q. □ 
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